Translating User-Generated Content in the Social Networking Space

نویسندگان

  • Jie Jiang
  • Andy Way
  • Rejwanul Haque
چکیده

This paper presents a case-study of work done by Applied Language Solutions (ALS) for a large social networking provider who claim to have built the world’s first multi-language social network, where Internet users from all over the world can communicate in languages that are available in the system. In an initial phase, the social networking provider contracted ALS to build Machine Translation (MT) engines for twelve languagepairs: Russian⇔English, Russian⇔Turkish, Russian⇔Arabic, Turkish⇔English, Turkish ⇔Arabic and Arabic⇔English. All of the input data is user-generated content, so we faced a number of problems in building largescale, robust, high-quality engines. Primarily, much of the source-language data is of ‘poor’ or at least ‘non-standard’ quality. This comes in many forms: (i) content produced by non-native speakers, (ii) content produced by native speakers containing non-deliberate typos, or (iii) content produced by native speakers which deliberately departs from spelling norms to bring about some linguistic effect. Accordingly, in addition to the ‘regular’ preprocessing techniques used in the building of our statistical MT systems, we needed to develop routines to deal with all these scenarios. In this paper, we describe how we handle shortforms, acronyms, typos, punctuation errors, non-dictionary slang, wordplay, censor avoidance and emoticons. We demonstrate automatic evaluation scores on the social network data, together with insights from the the social networking provider regarding some of the typical errors made by the MT engines, and how we managed to correct these in the engines.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Implications of User Generated Content on Facebook

The purpose of this study is to examine the implications (user benefits and costs) of user generated content posted by users on Facebook to individual users. Although motivations to use social networking sites are widely researched and published, studies on implications of information on social networking sites is sparse. Hence, this study addresses this gap by an interpretive analysis of user ...

متن کامل

Mining User Profiles to Support Structure and Explanation in Open Social Networking

The proliferation of media sharing and social networking websites has brought with it vast collections of site-specific user generated content. The result is a Social Networking Divide in which the concepts and structure common across different sites are hidden. The knowledge and structures from one social site are not adequately exploited to provide new information and resources to the same or...

متن کامل

Automatic Hashtag Recommendation in Social Networking and Microblogging Platforms Using a Knowledge-Intensive Content-based Approach

In social networking/microblogging environments, #tag is often used for categorizing messages and marking their key points. Also, since some social networks such as twitter apply restrictions on the number of characters in messages, #tags can serve as a useful tool for helping users express their messages. In this paper, a new knowledge-intensive content-based #tag recommendation system is intr...

متن کامل

Social Networking Websites - A Concatenation of Impersonation, Denigration, Sexual Aggressive Solicitation, Cyber-Bullying or Happy Slapping Videos

Hands-off legislation, toothless policy statements, unknowing parents, uncaring participants, and unwilling social network intermediaries (SNIs), have conspired to invite impersonation, denigration, sexual or aggressive solicitation, cyber-bullying, and happy slapping to the members of most social networking websites (SNWs). The situation is serious serious because the user-generated content (U...

متن کامل

Your Members Are Also Your Customers: Marketing for Internet Social Networks

Perhaps the fastest growing arena in the World Wide Web is the space of social networking sites (e.g., Friendster, Facebook, MySpace). The success of these sites directly depends on the number and activity level of their users. What attracts users to the site is a continually changing digital content (e.g., messages, pictures, photos, music, videos, blogs) generated by other users. In contrast,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012